Consistency and bias

GVPT722

What problem are we solving?

Building confidence in your inference from a finite number of random samples.

Data and packages

library(tidyverse)
library(poliscidata)
library(modelsummary)
library(broom)
library(ggdist)

Because we are working with randomness:

set.seed(222)

Consistency

Refers to the probability that each random sample from our population will produce a similar set of estimates of our regression coefficients.

Consistency

What is the relationship between an individual’s feelings towards President Obama and their party affiliation?

poliscidata::nes |> 
  select(caseid, obama_therm, dem) |> 
  glimpse()
Rows: 5,916
Columns: 3
$ caseid      <dbl> 408, 3282, 1942, 118, 5533, 5880, 1651, 6687, 5903, 629, 1…
$ obama_therm <dbl> 15, 100, 70, 30, 70, 45, 50, 60, 15, 100, NA, 0, 45, 30, 4…
$ dem         <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Consistency

What is the relationship between an individual’s feelings towards President Obama and their party affiliation?

m <- lm(obama_therm ~ dem, data = nes)
 (1)
(Intercept) 44.245***
Democrat 41.061***
Num.Obs. 5474
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Random sample from our “population”

Let’s start with a random sample of 100 respondents:

nes_100 <- nes |> 
  sample_n(100) |> 
  select(caseid, obama_therm, dem)

glimpse(nes_100)
Rows: 100
Columns: 3
$ caseid      <dbl> 405, 5672, 372, 3522, 3882, 1437, 6178, 1148, 5379, 5953, …
$ obama_therm <dbl> 100, 85, 40, 100, 100, 100, NA, 60, 50, 100, 70, 100, NA, …
$ dem         <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0…

Learning from this sample

m <- lm(obama_therm ~ dem, data = nes_100)
 (1)
(Intercept) 47.268***
Democrat 44.192***
Num.Obs. 93
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Acknowledging randomness

Let’s take a different random sample of 100 respondents:

nes_100 <- nes |> 
  sample_n(100) |> 
  select(caseid, obama_therm, dem)

glimpse(nes_100)
Rows: 100
Columns: 3
$ caseid      <dbl> 6111, 148, 1913, 6246, 6640, 3983, 6779, 3650, 1452, 4943,…
$ obama_therm <dbl> 85, 100, 80, NA, 50, 100, NA, 70, 100, 50, 70, 70, 60, 85,…
$ dem         <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0…

Acknowledging randomness

Let’s take a different random sample of 100 respondents:

m <- lm(obama_therm ~ dem, data = nes_100)
 (1)
(Intercept) 39.057***
Democrat 42.548***
Num.Obs. 96
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Acknowledging randomness

Let’s take 1,000 different random samples of 100 respondents.

Building confidence in our one random sample

We can only take a finite number of random samples from our population.

  • How can we increase our confidence that the estimated coefficients produced by these random samples is close to the truth?

Building confidence in our one random sample

We can do this by increasing our sample size. The larger the sample size, the more consistent the estimates.

  • Let’s look at 1,000 different random samples of 300 respondents.

  • And 1,000 different random samples of 1,000 respondents.

Do we get more consistent estimates?

Building confidence in our one random sample

Bias

A biased coefficient estimate will systematically be higher or lower than the true value.

What if we only sampled from males?

What happens to our understanding of the relationship between an individual’s feelings towards Obama and their party affiliation?

nes_men <- nes |> 
  filter(gender == "Male") |> 
  select(caseid, obama_therm, dem, gender)

glimpse(nes_men)
Rows: 2,847
Columns: 4
$ caseid      <dbl> 408, 3282, 1942, 118, 5533, 5880, 1651, 6687, 5903, 629, 1…
$ obama_therm <dbl> 15, 100, 70, 30, 70, 45, 50, 60, 15, 100, NA, 0, 45, 30, 4…
$ dem         <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ gender      <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male…

A consistent but biased estimate

Let’s take a random sample of 1,000 individuals from this male-only pool.

Males only
(Intercept) 43.575***
Democrat 41.641***
Num.Obs. 925
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

A consistent but biased estimate

Males only All respondents 'True' relationship
(Intercept) 43.575*** 43.102*** 44.245***
Democrat 41.641*** 41.505*** 41.061***
Num.Obs. 925 934 5474
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Summary

  • Cannot take infinite samples from our population

  • Can use our understanding of uncertainty to increase our confidence in a single or finite number of random samples from our population (consistency)

  • Need to ensure that we are not excluding groups of observations from the population from which we draw those random samples (bias)

  • We aim to have consistent and unbiased estimates of our coefficients